Analysis of Boston data

Overview of data

Boston data is included to R-package as a demonstartion or example.

Dataset contains information about Boston area. It includes following variables:

  • crim = per capita crime rate by town
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town
  • chas = Charles River dummy variable
  • nox = nitrogen oxides concentration (parts per 10 million)
  • rm = average number of rooms per dwelling
  • age = proportion of owner-occupied units built prior to 1940
  • dis = weighted mean of distances to five Boston employment centres
  • rad = index of accessibility to radial highways
  • tax = full-value property-tax rate per $10000
  • ptratio = pupil-teacher ratio by town
  • black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • lstat = lower status of the population (percent)
  • medv = median value of owner-occupied homes in $1000s
Structure and the dimensions of the data

Dataset has 14 variables and 506 observationsand all variables are numerical.

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
[1] 506  14
Summary, graphical presentation of data and correlations

As seen in pairs plot, most of the variables are not normally distributed. Most of them are skewed and some of them are bimodal. Correlations between variables are better viewed in correlation plotting, where on the upper-right side the biggest circles indicate highest correlations (blue = positive or red = negative). Corresponding number values are mirrored on the lower-left side.

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  


Standardization

In standardization means of all variables are in zero. That is, variables have distributed around zero. This can be seen in summary table (compare with original summary above).

Variable crime rate has been changed to categorical variable with 4 levels: low, med_low, med_high and high. Each class includes quantile of data (25%).

Train and test sets have been created by dividing original (standardized) data to two groups randomly. 80% belongs to train set and 20% to test set.

      crim                 zn               indus        
 Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
 1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
 Median :-0.390280   Median :-0.48724   Median :-0.2109  
 Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
 Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
      chas              nox                rm               age         
 Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
 1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
 Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
 Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
      dis               rad               tax             ptratio       
 Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
 1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
 Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
 Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
     black             lstat              medv        
 Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
 1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
 Median : 0.3808   Median :-0.1811   Median :-0.1449  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
 Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865  

Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.

Call:
lda(crime ~ ., data = train)

Prior probabilities of groups:
      low   med_low  med_high      high 
0.2549505 0.2500000 0.2400990 0.2549505 

Group means:
                 zn      indus          chas        nox         rm
low       0.9687092 -0.9112840 -0.0812076970 -0.8801289  0.4856138
med_low  -0.0752376 -0.2341834  0.0005392655 -0.5493418 -0.1484982
med_high -0.3891090  0.1911363  0.2553235412  0.4195305  0.1531896
high     -0.4872402  1.0170891 -0.0047591488  1.0637606 -0.3778017
                age        dis        rad        tax     ptratio
low      -0.8850018  0.8612329 -0.6964270 -0.7510331 -0.43463934
med_low  -0.3588296  0.3766392 -0.5452263 -0.4622746 -0.06452449
med_high  0.4671902 -0.4076923 -0.4052697 -0.2994846 -0.32517548
high      0.7894869 -0.8439913  1.6384176  1.5142626  0.78111358
               black       lstat        medv
low       0.38230915 -0.78699174  0.56372847
med_low   0.31815664 -0.11298399 -0.02391351
med_high  0.08123998  0.01135132  0.20559787
high     -0.89805403  0.87842110 -0.70334407

Coefficients of linear discriminants:
                 LD1          LD2         LD3
zn       0.075242940  0.639801821 -0.94141064
indus   -0.002056955 -0.181994976  0.45929553
chas    -0.060535759 -0.050434476  0.08313609
nox      0.383305156 -0.766013417 -1.23180420
rm      -0.107459881 -0.059396218 -0.13670332
age      0.235528477 -0.389034630 -0.24579427
dis     -0.088154061 -0.219578569  0.29277345
rad      3.148537133  0.965481440 -0.06910379
tax      0.027648987 -0.010747157  0.46671769
ptratio  0.093643241 -0.009991854 -0.28780485
black   -0.137958980  0.037344629  0.15296989
lstat    0.230846376 -0.179505584  0.45412181
medv     0.191849735 -0.391659595 -0.17507334

Proportion of trace:
   LD1    LD2    LD3 
0.9474 0.0387 0.0140 

Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.

          predicted
correct    low med_low med_high high Sum
  low       13       9        2    0  24
  med_low    3      15        7    0  25
  med_high   1      11       16    1  29
  high       0       0        0   24  24
  Sum       17      35       25   25 102
[1] 0.3333333

Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)

Bonus: Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?

Super-Bonus:

Adjust the code: add argument color as a argument in the plot_ly() function. Set the color to be the crime classes of the train set. Draw another 3D plot where the color is defined by the clusters of the k-means. How do the plots differ? Are there any similarities?

[1] 404  13
[1] 13  3